Red Wines - EDA by Christophe

I have choosen the red wine dataset from a specific vineyard: the Portuguese “Vinho Verde”. The dataset has due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables.The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Finally no missing values exist in the dataset.

Inputs variables:

1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)  3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)  5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm^3)
11 - alcohol (% by volume)

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

12 - quality (score between 0 and 10)

Univariate Plots Section

What is the size of the dataset?

## [1] 1599   13

There are 13 variables because the first one (“X”) is like an ID for each
observation.

What look like the different variables?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Some statistics on the features

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Let’s create a rating variable for easing the plotting:

wine$rating<-ordered(wine$quality, levels=c(1,2,3,4,5,6,7,8,9,10))

Let’s see the number of wine in each quality rating

## [1] 1.125704

The number of wines with the highest quality count only for 1.12% of the dataset.

Are these red wines with high alcohol degree?

A high number of wine have around 9.5%. The majority of wine have alcohol in a range of 9 to 12 which is a low degree for wine. However the origin of wine is known and it is normal for this location such result.

Let’s have a look at the pH of the wines

As we could expect it is in the range of 3 to 3.5. It is odd to have no wine around 3.7 or 2.9 but it might be due to a missing value rather than an error in the data. The curve looks like a normal distribution.

Let’s see the amount of salt in wine

As we could expect it is quite low.

Let’s check the density of wine which is in relation with the residual sugar and alcohol

The curve has a normal shape and it is below 1 as expected. Above 1, means that the fermentation process is not completely executed and it might provide bad wine (we will check this assumption later)

Let’s see then the amount of sugar which is also an indicator for the sweetness of the wine

The wines in the dataset are not not really sweet as the majority are below the mean of around 3g/L.

Let’s see the volatile acidity variables

The curve is little positive skew.

Let’s see the amount of sulphates in the wines

The amount of sulphates is between 0.5 to 1 g/L. It is an important information as it provides SO2 to the wine to prevent oxygenation and bacterial proliferation. If sufficient free sulfur dioxides it will lower volatile acidity. We will see this relation in the next chapter.

Let’s see then the total SO2

We can see some outliers but the majority is below 100 and it is expected as in Europe the legal limit for red wine (in general) is 150 mg/L.

Let’s see the free SO2 which is important as the molecules will protect the wine

Some statistics about SO2:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The curve is a right-skewed distribution with a mean at 15.87.

Let’s see the amount of citric acidity which adds a freshness to the wine

The curve is positively skewed distribution.It is used for basic wine but introduces an instability in microbial environment. Due to this defect, winemakers use more often tartaric acid (here it is the fixed acidity parameter) to acidify wines.

Univariate Analysis

What is the structure of your dataset?

The dataset has 1599 observations of 13 variables (including the variable rating which I created). They are numerical except for quality, X (integer values) and rating (ordinal factor). The data is also tidy.

What is/are the main feature(s) of interest in your dataset?

The main interest is to determine the variable which are responsible for a good wine. In literature the quality of a wine is based on the level of citric acid, alcohol, pH and residual sugar. I will check these features more closely then.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I will check the level of density which is an indication for the fermentation process and the SO2 which is also a composant to protect the wine against oxygenation and microbial environment.

Did you create any new variables from existing variables in the dataset?

Yes I have created an ordered factor for easing the plotting in some investigation.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the
form of the data? If so, why did you do this?

No, the dataset was tidy and already wrangled. I have noticed some skewed distribution and outliers but nothing that will influence our investigation heavely.

Bivariate Plots Section

Let’s have an overview by combining variables amongst them

Volatile acidity (VA) is often associated with oxidation problems in a wine due to the fact that both result from overexposure to oxygen and/or a lack of sulfur dioxide management.

The low VA and free sulfur dioxide could mean that the excess of free SO2 maintain a good level of VA.

In winemaking, the citric-sugar co-metabolism can also increase the formation of volatile acid in wine which can affect the wine aroma negatively if present at excessive levels.

Let’s see the rating with the degree of alcohol

The rating against the sugar would give information about the sweetness of wines

The quantity of SO2 is important as it protects the wine.

Is there a discrepency in the level of SO2 amongst wine?

There is no huge difference especially with wine of quality 5 and 6.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$free.sulfur.dioxide and wine$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

Here is the strongest correlation found thanks to the help of ggpairs. However this relationship is interesting but understandable as free SO2 is included in the total sulfur dioxide.

We would normally expect that more alcohol will reduce the amount of sugar. Here it remains constant (approximately) at low level.

As we can expect more sugar will increase the density.

As the density is linked with alcohol, let’s see if it is the case

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

As we can see more alcohol induces less density which is expected.

US legal value for volatile acidity is 1.2 g/L

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

As I focus on quality wine, I have checked further pH, alcohol,sugar and citric acid.I found that wines have a constant low amount of residual sugar which means they are not sweet wines and do require a high level of fixed acidity to balance sugar. I observed also many outliers with the sugar level depending on the rating.The highest rating has more alcohol, higher level of citric acid and less pH. They are only 1.12% of the dataset which means either wine are not good in general or some data are missing. The latter has better chance to be true.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

All wines (except some outliers) are below the legal limit for volitale acidity. The level of VA is maintened with a low level of free sulfur dioxide which is good otherwise it might begin to smell and the wine considered as bad. Finally the relationship between free SO2 and total SO2 is positive and strong in terms of correlation. We saw that more alcohol induces less density.

What was the strongest relationship you found?

The strongest positive was between free sulfur dioxide and total sulfur dioxide with a correlation of 0.67. The second one was between pH and fixed acidity at -0.68.Finally citric acid and fixed acidity at a level of 0.67.

Multivariate Plots Section

Lets see the different variables of interest against quality

First, let’s see the sugar versus the degree of alcohol

Another way to see the same data

As we can see, the better the wine the lower the sugar level.

Now let’s see if the percentage of SO2 present against alcohol are related ?

We have limited the axis as the majority of data are within these limits as well as the wine quality(6 and above).

Let’s see by making different plots with the combinaison of alcohol, sugar and fixed acidity, if something appears

Even if these 3 parameters are linked for the quality of the wine, these 3 plots are not helpful.

Lets see with the volatile acidity

As expecter lower quality wines have higher volatile acidity

Finally let’s see if the pH is related to SO2

Let’s create a model to estimate the level of alcohol

m1 <- lm(alcohol ~ fixed.acidity, data = wine)
m2 <- update(m1, ~ . + pH)
m3 <- update(m2, ~ . + residual.sugar)
m4 <- update(m3, ~ . + citric.acid)
m5 <- update(m4, ~ . + density)
m6 <- update(m5, ~ . + chlorides)
m7 <- update(m6, ~ . + sulphates)
mtable(m1, m2, m3, m4,m5,m6,m7)
## 
## Calls:
## m1: lm(formula = alcohol ~ fixed.acidity, data = wine)
## m2: lm(formula = alcohol ~ fixed.acidity + pH, data = wine)
## m3: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar, data = wine)
## m4: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar + 
##     citric.acid, data = wine)
## m5: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar + 
##     citric.acid + density, data = wine)
## m6: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar + 
##     citric.acid + density + chlorides, data = wine)
## m7: lm(formula = alcohol ~ fixed.acidity + pH + residual.sugar + 
##     citric.acid + density + chlorides + sulphates, data = wine)
## 
## ====================================================================================================================
##                        m1            m2            m3            m4            m5            m6            m7       
## --------------------------------------------------------------------------------------------------------------------
##   (Intercept)        10.737***      2.667**       2.579**       1.909*      607.523***    611.109***    614.217***  
##                      (0.130)       (0.887)       (0.887)       (0.863)      (12.699)      (13.251)      (12.759)    
##   fixed.acidity      -0.038*        0.090***      0.087***     -0.024         0.560***      0.567***      0.571***  
##                      (0.015)       (0.020)       (0.020)       (0.023)       (0.019)       (0.020)       (0.019)    
##   pH                                2.115***      2.120***      2.469***      3.934***      3.981***      3.954***  
##                                    (0.230)       (0.230)       (0.226)       (0.148)       (0.156)       (0.150)    
##   residual.sugar                                  0.039*        0.023         0.267***      0.268***      0.276***  
##                                                  (0.019)       (0.018)       (0.013)       (0.013)       (0.012)    
##   citric.acid                                                   1.785***      0.833***      0.810***      0.529***  
##                                                                (0.177)       (0.115)       (0.118)       (0.116)    
##   density                                                                  -617.700***   -621.535***   -625.188***  
##                                                                             (12.940)      (13.559)      (13.056)    
##   chlorides                                                                                 0.360        -0.968*    
##                                                                                            (0.380)       (0.384)    
##   sulphates                                                                                               1.154***  
##                                                                                                          (0.102)    
## --------------------------------------------------------------------------------------------------------------------
##   R-squared           0.004         0.054         0.057         0.113         0.635         0.635         0.662     
##   adj. R-squared      0.003         0.053         0.055         0.111         0.634         0.634         0.661     
##   sigma               1.064         1.037         1.036         1.005         0.645         0.645         0.621     
##   F                   6.097        45.476        31.892        50.849       554.550       462.245       445.701     
##   p                   0.014         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood  -2367.035     -2325.770     -2323.506     -2274.067     -1564.049     -1563.598     -1502.210     
##   Deviance         1807.863      1716.921      1712.066      1609.403       662.182       661.809       612.895     
##   AIC              4740.070      4659.541      4657.013      4560.134      3142.098      3143.196      3022.420     
##   BIC              4756.201      4681.049      4683.898      4592.397      3179.738      3186.213      3070.814     
##   N                1599          1599          1599          1599          1599          1599          1599         
## ====================================================================================================================

Alcohol should be 9.4 (as per our dataset) for the following input:

thisWine = data.frame(pH = 3.51, fixed.acidity = 7.4,chlorides=0.076, 
                         sulphates=0.56,residual.sugar = 1.9, 
                         citric.acid = 0,density = 0.9978)
modelEstimate = predict(m7, newdata = thisWine,
                        interval="prediction", level = .95)
exp(modelEstimate)
##     fit     lwr      upr
## 1 14824 4378.55 50188.07

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Great wines are in balance with their 4 fundamental traits (acidity, tannin, alcohol and sweetness)

The number of wine of great quality in our dataset is 18 and they have alcohol degree of around 12.We can see that good wines have more higher percentage of SO2.The level of volatile acidity is low with a higher degree of alcohol. The level of sugar are nearly the same for each category of wine but I observed a slightly higher one for the top range. Maybe this add some complexity to the wine and was therefore more appealing for the tester.Nevertheless the result we have is expected when we know which kind of wine have been measured as it is typical from this region of Portugal.

Were there any interesting or surprising interactions between features?

Yes,even if 3 parameters (residual sugar, alcohol and fixed acidity) are linked for the quality of the wine, it does not mean that a plot with these features together would be helpful. If we extrapolate, it is not suprising that our model that we calculate was not efficient.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I have tried to get a model to predict the level of alcohol in a wine based on the physical variables. However the R squared is at best for 7 variables at 0.66 which is low.The model is not relevant for the purpose based on the result. Adding more variables have not improved significantly the R squared so I didn’t search further.


Final Plots and Summary

Plot One

Description One

This graph comes from the univariant section and it shows that in the dataset a lot of wines were with low alcohol degree. This is normal for “Vinho Verde” wines and it was the start of the search to understand what variables were important for a good wine.

Plot Two

Description Two

This graph comes from the bivariant section. It shows the mean of alcohol for each category of wine compared to the blue line which is the mean of alcohol for all wines. We can see that better wines have higher alcohol.

Plot Three

Description Three

This graphs comes from the last section. It shows that all wines, independently from their ratings and degree of alcohol have low sugar. These wines are not really sweet then.


Reflection

The dataset has 1599 observations from 12 variables. The dataset is tidy but concentrated on medium quality wine i.e. 5-6 mainly. It is difficult then to have sufficent data to understand if a specific variables would add something for the quality of wine tested or not. However the rating is subjective for everyone and even if the physical components of the wine are in the right region the combinaison of them might not be ideal. Thus the vinification is still an art rather than a science.

Others factors could have also induced differents results. It would have been nice to have wines’ prices, when they have been harvested (if it is later it would have more sugar), the kind of earth where they grow, which grapes are in the wines amongst others.

Definitevely our model we calculated was not accurate even if the degree of alcohol should have been possible to predict but I guess some additional parameters have to be taken into account in the fermentation process.

References:

http://wineserver.ucdavis.edu/industry/enology/methods_and_techniques/techniques/ph_analysis.html

https://winefolly.com/review/understanding-acidity-in-wine/

https://vinepair.com/wine-blog/7-things-you-need-to-know-about-vinho-verde/

http://winemakersacademy.com/potassium-metabisulfite-additions/

http://www.diwinetaste.com/dwt/en2007026.php

https://grapesandwine.cals.cornell.edu/sites/grapesandwine.cals.cornell.edu/files/shared/documents/Research-Focus-2011-3.pdf